Skip to content

Don't fail Supervisor setup when an app image is missing#6816

Merged
agners merged 3 commits into
mainfrom
improve-startup-missing-container-image-handling
May 20, 2026
Merged

Don't fail Supervisor setup when an app image is missing#6816
agners merged 3 commits into
mainfrom
improve-startup-missing-container-image-handling

Conversation

@agners
Copy link
Copy Markdown
Member

@agners agners commented May 6, 2026

Proposed change

A missing builder image (e.g. docker:29.4.2-cli, when the host's exact Docker patch version has no matching -cli tag published on Docker Hub) during a build-required app load aborted Supervisor setup entirely. The system was left in setup state where every subsequent operation was blocked by the not-healthy guard. Recovery required either Docker Hub publishing the tag or a manual workaround.

Two issues compounded the failure:

  1. images.pull in DockerAPI.run_command leaked a raw aiodocker.DockerError past the @Job decorator. Since aiodocker.DockerError is not a HassioError, the decorator rewrapped it as JobException, which then bypassed the suppress(DockerError, ...) guard in App.load() that was designed to keep one bad app from killing setup.
  2. App.load() treated all Docker errors the same — a 404 "image not in cache" was indistinguishable from a "daemon is sick" 5xx, so a real install attempt could fall through into the with suppress(...) and silently succeed-or-fail without surfacing anything to the user.

This PR addresses both:

  • Wrap the pull error in run_command so it propagates as Supervisor's DockerError (a HassioError) and is preserved unchanged by the @Job decorator.
  • Distinguish 404s in DockerInterface.attach() and DockerInterface.check_image() by raising DockerNotFound/DockerAPIError instead of generic DockerError.
  • In App.load(), only the DockerNotFound path is treated as "image missing":
    • For build-required apps the inline build is skipped and a MISSING_IMAGE repair (with EXECUTE_REPAIR suggestion) is created so the resolution autofix loop handles it off the setup critical path.
    • For pull-based apps the install is still attempted during load and the repair is created on failure, preserving the existing recovery behavior.
  • Other DockerErrors (daemon trouble, or a failed internal install inside check_image's arch-mismatch path) are logged at CRITICAL — which the Sentry logging integration captures as an event — and the app is left detached. We deliberately do not raise a MISSING_IMAGE repair in that case because it would promise a fix the autofix can't deliver (those errors are not resolved by retrying install()).
  • In FixupAppExecuteRepair, swallow DockerBuildError, DockerNoSpaceOnDevice, DockerRegistryAuthError, and DockerRegistryRateLimitExceeded as ResolutionFixupError so they don't generate a Sentry event on every retry. The repair stays available for manual retry once the underlying cause (registry tag published, disk freed, credentials fixed, rate limit expired) is resolved.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:
  • Link to cli pull request:
  • Link to client library pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

@agners agners added the bugfix A bug fix label May 6, 2026
@agners
Copy link
Copy Markdown
Member Author

agners commented May 6, 2026

Stack trace of the original issue:
2026-05-06 14:11:01.499 ERROR (MainThread) [supervisor.jobs] Unhandled exception: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown
2026-05-06 14:11:01.499 ERROR (MainThread) [supervisor.jobs] Unhandled exception: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown
Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/addons/addon.py", line 257, in load
    await self.instance.attach(version=self.version)
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 307, in wrapper
    raise err
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 299, in wrapper
    return await method(obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/docker/interface.py", line 457, in attach
    raise DockerError(
        f"Could not get metadata on container or image for {self.name}"
    )
supervisor.exceptions.DockerError: Could not get metadata on container or image for addon_f4f71350_ewelink_smart_home_slug

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/docker/manager.py", line 641, in run_command
    await self.images.inspect(f"{image}:{tag}")
  File "/usr/local/lib/python3.14/site-packages/aiodocker/images.py", line 48, in inspect
    response = await self.docker._query_json(f"images/{name}/json")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 541, in _query_json
    async with self._query(
               ~~~~~~~~~~~^
        path,
        ^^^^^
    ...<6 lines>...
        versioned_api=versioned_api,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ) as response:
    ^
  File "/usr/local/lib/python3.14/contextlib.py", line 214, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 433, in _query
    yield await self._do_query(
          ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
    )
    ^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 514, in _do_query
    raise DockerError(response.status, data["message"])
aiodocker.exceptions.DockerError: [404] No such image: docker:29.4.2-cli

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 299, in wrapper
    return await method(obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/docker/addon.py", line 673, in install
    await self._build(version, image)
  File "/usr/src/supervisor/supervisor/docker/addon.py", line 739, in _build
    result = await self.sys_docker.run_command(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
    )
    ^
  File "/usr/src/supervisor/supervisor/docker/manager.py", line 645, in run_command
    await self.images.pull(image, tag=tag)
  File "/usr/local/lib/python3.14/site-packages/aiodocker/images.py", line 154, in _handle_list
    async with cm as response:
               ^^
  File "/usr/local/lib/python3.14/contextlib.py", line 214, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 433, in _query
    yield await self._do_query(
          ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
    )
    ^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 514, in _do_query
    raise DockerError(response.status, data["message"])
aiodocker.exceptions.DockerError: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown

With this change, the Supervisor handles this error (and other similar ones) more gracefully:

2026-05-06 14:43:08.842 INFO (MainThread) [supervisor.addons.addon] No f4f71350_ewelink_smart_home_slug app Docker image f4f71350/amd64-addon-ewelink_smart_home_slug found
2026-05-06 14:43:08.842 INFO (MainThread) [supervisor.resolution.module] Create new suggestion execute_repair - addon / f4f71350_ewelink_smart_home_slug
2026-05-06 14:43:08.842 INFO (MainThread) [supervisor.resolution.module] Create new issue missing_image - addon / f4f71350_ewelink_smart_home_slug
...
2026-05-06 15:43:13.948 ERROR (MainThread) [supervisor.docker.manager] Can't pull image docker:29.4.2-cli: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown
2026-05-06 15:43:13.948 ERROR (MainThread) [supervisor.docker.addon] Can't build f4f71350/amd64-addon-ewelink_smart_home_slug:1.4.6: Can't pull image docker:29.4.2-cli: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown
...
2026-05-06 15:43:13.949 WARNING (MainThread) [supervisor.resolution.fixup] Error during processing execute_repair: Can't build f4f71350/amd64-addon-ewelink_smart_home_slug:1.4.6: Can't pull image docker:29.4.2-cli: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown

Copy link
Copy Markdown
Member

@sairon sairon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A missing builder image (e.g. docker:29.4.2-cli, when the host's exact Docker patch version has no matching -cli tag published on Docker Hub)

Note that this is a situation that shouldn't normally happen. But because Docker messed up something in their packaging, their CI is failing and 29.4.2 images are missing on Docker Hub. Normally they're published within a day since publishing. Also, OS build will fail too if those images are not published, so this only affects early adopters on Supervised and dev environments.

@RubenNL
Copy link
Copy Markdown
Contributor

RubenNL commented May 6, 2026

For everyone who found this issue and needs a quick workaround:

docker pull docker:29.4.1-cli
docker tag docker:29.4.1-cli docker:29.4.2-cli

Of course, this is a ugly fix and shouldn't be used long term. To remove the tag, just run docker image rm docker:29.4.2-cli

@sairon
Copy link
Copy Markdown
Member

sairon commented May 7, 2026

29.4.3 was released yesterday and it's making its way to the Docker Hub: docker-library/docker@85f8094

So hopefully it should be resolved within a day or two.

Copy link
Copy Markdown
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess comment is best here. I don't know if changes are needed but I do need to confirm some stuff, namely about whether we actually want to delay image builds until after setup or just want to keep the exceptions from breaking setup. Because currently we're only doing the latter with this change, image builds will still occur during setup just at a different step.

Comment thread supervisor/addons/addon.py Outdated
Comment on lines +277 to +280
# Docker error other than a clean "image not found" - we can't
# tell whether the image is actually missing. Log and leave the
# addon detached; a future load will reattempt and surface a
# MISSING_IMAGE repair if appropriate.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This references a "future load" that will fix this. But there is no future load. There's only 3 times we call load right now:

  1. Startup of Supervisor we call it for each installed addon
  2. On install of a new app
  3. On restore of an app (but only if it was newly installed so really this is still 2 )

Could be just this comment is incorrect which is nbd but wanted to make sure there wasn't a misunderstanding. Currently if this load/attach process fails there is no fallback/retry mechanism in place. If we need that now, we have to add that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this needs a manual interaction. We don't raise a repair for this issue currently, I don't think its worth the effort this is a corner case. Users affected can just go to the app page and trigger a rebuild 🤷 .

# Dockerfile or unavailable base/builder image; disk full; bad
# credentials; registry rate limit). Surface as a fixup error so
# FixupBase swallows it without a Sentry event. The repair stays
# available for manual retry once the underlying cause is fixed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I realized looking at this - there's only one instance of this class created total. Its created at Supervisor startup and we use the same instance during the entire time Supervisor is running. So self.attempts is never reset, once you get 5 failures then this is just a manual fixup until Supervisor is restarted.

If we're trying to improve this fixup, maybe we want to reset attempts on success? Or make give each addon its own attempts count using a dictionary? Existing issue so doesn't have to be tackled here, just noting it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah since this is a rather corner case issue I'd rather prefer to not add more complexity.

Comment on lines +265 to +268
# Don't run a local build during setup. Surface a repair so
# the resolution autofix loop can handle it off the critical
# path.
self._create_missing_image_issue()
Copy link
Copy Markdown
Contributor

@mdegat01 mdegat01 May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't actually move this logic out of setup btw. It will move it out of setup of AddonManager but ResolutionManager is loaded afterwards here:

self.sys_resolution.load(),

And as part of its load it runs a healthcheck which then runs autofixes. If your goal is simply to prevent exceptions raised from building from breaking setup then that still accomplishes that, since exceptions raised by autofix fixups won't break setup. But if your goal is to stop setup from waiting for images to be built then you should adjust this logic:

@property
def auto(self) -> bool:
"""Return if a fixup can be apply as auto fix."""
return self.attempts < MAX_AUTO_ATTEMPTS

To something like this:

    @property
    def auto(self) -> bool:
        """Return if a fixup can be apply as auto fix."""
        return self.sys_core.state not in CoreState.SETUP and self.attempts < MAX_AUTO_ATTEMPTS

Or provide a fixed list of states you want CoreState to be in.

Bear in mind though, there is currently no other healthcheck between the end of SETUP and when apps are started during STARTUP. So currently any addons which exit SETUP without their image available will effectively have boot disabled. Since they will fail to start during boot and then will have to manually started after. Unless we add another healthcheck in at the top of Core.start.

Which on that note, we should probably temporarily disable boot on any addons which we have decided we cannot download or build an image for right now. Else we'll just try again during STARTUP and fail again.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as part of its load it runs a healthcheck which then runs autofixes. If your goal is simply to prevent exceptions raised from building from breaking setup then that still accomplishes that, since exceptions raised by autofix fixups won't break setup.

It is certainly the main aim of this PR.

But if your goal is to stop setup from waiting for images to be built then you should adjust this logic:

So that came as an afterthought: How often do we even need to build on setup? I encountered this on my development system, where I had an app which no longer builds. I've cleaned Docker images at one point, that is probably why I've started running into it. Once you have such a non-building app, you'll encounter it on every startup, and it will slowdown the start. So I felt like let's punt this.

I have no idea how often users run into this, probably almost never. If build fails on install, we rollback the installation of an app, so normally users should not encounter this at all.

From what I can tell this is really a corner case scenario, so I can life with either approach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't actually move this logic out of setup btw. It will move it out of setup of AddonManager but ResolutionManager is loaded afterwards here:

Actually, it does: run_autofix has JobCondition.RUNNING. So the code as is already defers to running.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that, that covers it then.

A missing builder image (docker:<version>-cli) during a build-required app
load aborted Supervisor setup entirely, leaving the system stuck in setup
state where every subsequent operation was blocked by the not-healthy
guard. Triggered in practice when the host's Docker patch version had no
matching `-cli` tag published on Docker Hub.

Two issues compounded the failure: `images.pull` in `run_command` leaked a
raw `aiodocker.DockerError` past the `@Job` decorator, which rewrapped it
as `JobException` and bypassed the `suppress(DockerError, ...)` guard in
`addon.load()`; and the load path treated all Docker errors the same
whether the image was simply missing or the daemon itself was misbehaving.

Wrap the pull error in `run_command` so it propagates as Supervisor's
`DockerError` (a `HassioError`) and is preserved by the decorator.
Distinguish 404s in `attach()` and `check_image()` by raising
`DockerNotFound`/`DockerAPIError` instead of generic `DockerError`. In
`addon.load()`, only the `DockerNotFound` path is treated as "image
missing": for build-required apps we skip the inline build and surface a
`MISSING_IMAGE` repair so the resolution autofix loop handles it off the
critical path; for pull-based apps we still attempt install during load
and create the repair on failure. Other `DockerError`s (daemon trouble or
a failed internal install in `check_image`) are logged at CRITICAL — which
the Sentry logging integration captures — and the addon is left detached
rather than masked as a misleading missing-image repair.

In the autofix path, swallow `DockerBuildError`, `DockerNoSpaceOnDevice`,
`DockerRegistryAuthError`, and `DockerRegistryRateLimitExceeded` as
`ResolutionFixupError` so they don't generate Sentry events on every
retry. The repair stays available for manual retry once the underlying
cause (registry tag published, disk freed, credentials fixed, rate limit
expired) is resolved.
@agners agners force-pushed the improve-startup-missing-container-image-handling branch from afc1165 to 183e66f Compare May 18, 2026 17:03
The comment claimed "a future load will reattempt and surface a
MISSING_IMAGE repair if appropriate", but App.load() is only called at
Supervisor startup, on fresh install, and on backup restore — there is no
automatic retry mechanism. Reword to match reality: the CRITICAL log
captures the issue for diagnostics (Sentry), and the user can trigger a
manual repair once the daemon is healthy.
@agners agners force-pushed the improve-startup-missing-container-image-handling branch from 9aa6665 to cbf75ba Compare May 18, 2026 17:25
@agners agners requested a review from mdegat01 May 18, 2026 17:27
Copy link
Copy Markdown
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looks good, LGTM 👍

@agners agners merged commit 0bcedf5 into main May 20, 2026
20 of 21 checks passed
@agners agners deleted the improve-startup-missing-container-image-handling branch May 20, 2026 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants